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Abstract. Given a stream of items each associated with a numerical 
value, its edit distance to monotonicity is the minimum number of items 
to remove so that the remaining items are non-decreasing with respect to 
the numerical value. The space complexity of estimating the edit distance 
to monotonicity of a data stream is becoming well-understood over the 
past few years. Motivated by applications on network quality monitoring, 
we extend the study to estimating the edit distance to monotonicity of 
a sliding window covering the w most recent items in the stream for any 
u; > 1. We give a deterministic algorithm which can return an estimate 
within a factor of (4 -I- e) using 0{-^ log^(ew)) space. 
We also extend the study in two directions. First, we consider a stream 
where each item is associated with a value from a partial ordered set. We 
give a randomized (4-|-e)-approximate algorithm using 0( log e^w log w) 
space. Second, we consider an out-of-order stream where each item is as- 
sociated with a creation time and a numerical value, and items may be 
out of order with respect to their creation times. The goal is to esti- 
mate the edit distance to monotonicity with respect to the numerical 
value of items arranged in the order of creation times. We show that any 
randomized constant-approximate algorithm requires linear space. 

1 Introduction 

Estimating the sortedness of a numerical sequence has found applications in, e.g., 
sorting algorithms, database management and vifebpage ranking (such as Pager- 
ank [4]). For example, sorting algorithms can take advantage of knowing the 
sortedness of a sequence so as to sort efficiently In relational database, many 
operations are best performed when the relations are sorted or nearly sorted over 
the relevant attributes [3] . Maintaining an estimate on the sortedness of the rela- 
tions can help determining whether a given relation is sufficiently nearly-sorted 
or a sorting operation on the relation (which is expensive) is needed. One com- 
mon measurement of sortedness of a sequence is its edit distance to monotonicity 
(or ED, in short) |2l7l8ll0fTT| : given a sequence a oi n items, each associated 
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with a value in [m] — {1,2,..., m}, the ED of a, denoted by ed(cr), is the mini- 
mum number of edit operations required to transform a to the sequence obtained 
by sorting a in non-decreasing order. Here, an edit operation involves removing 
an item and re-insert it into a new position of the sequence. Equivalently, ed(CT) 
is the minimum number of items in a to delete so that the remaining items have 
non-decreasing values. A closely related measurement is the length of the longest 
increasing subsequence (or LIS) of cr, denoted by lis((T). It is not hard to see that 
lis(CT) — n — ed{a). 

With the rapid advance of data collection technologies, the sequences usually 
appear in the form of a data stream, where the stream of items is massive in 
size (containing possibly billions of items) and the items are rapidly arriving 
sequentially. This gives rise to the problem of estimating ED in the data stream 
model: An algorithm is only allowed to scan the sequence sequentially in one 
pass, and it also needs to be able to return, at any time, an estimate on ED of 
the items arrived so far. The main concern is the space usage and update time 
per item arrival, which, ideally, should both be significantly smaller than the 
total data size (preferably polylogarithmic). 

Estimating ED of a data stream is becoming well-understood over the past 
few years [8110111) . Gopalan et al. [TT] showed that computing the ED of a 
stream exactly requires fi{n) space even for randomized algorithms, where n 
is the number of items arrived so far. They also gave a randomized (4 -I- e)- 
approximate algorithm for estimating ED using space O(j^log^n), where < 
e < 1. Later, Ergun and Jowhari [8, improved the result by giving a deterministic 
(2-|-e)-approximate algorithm using space 0(^ log^(en)). For the closely related 
LIS problem, Gopalan et al. TT also gave a deterministic (1 + e)-approximate 
algorithm for estimating LIS using 0(^/7) space. This space bound is proven 
to be optimal in [TO] . 

ED in sliding windows. The above results consider the sortedness of all 
items in the stream arrived so far, which corresponds to the whole stream model. 
Recently, it is suggested that ED can be an indicator of network quality [12] ■ The 
items of the stream correspond to the packets transmitted through a network, 
each associated with a sequence number. Ideally, the packets would arrive in 
increasing order of the sequence number. Yet network congestion would result in 
packet retransmission and distortion in the packet arrival order, which leads to 
a large ED value. One of the main causes to network congestion is that traffic is 
often bursty. Thus, the network quality can be measured more accurately if the 
measurement is based on only recent traffic. To this end, we propose studying 
the sliding window model where we estimate the ED of a window covering the 
latest w items in the stream. Here w is a positive integer representing the window 
size. The sliding window model is no easier than the whole data stream model 
because when w is set to be infinity, we need to estimate ED for all items arrived. 

Our results. We give a deterministic (4 + e)-approximate algorithm for esti- 
mating ED in a sliding window. The space usage is 0(^ log^(e'u;)), where w is 
the window size. Our algorithm is a generalization of the algorithm by Gopalan 
et al. [IT]. In particular, Gopalan et al. show that ED of the whole stream can be 



approximated by the number of "inverted" items j such that many items arrived 
before j has a value bigger than j. We extend this definition to the shding win- 
dow model. Yet, maintaining the number of inverted items in a sliding window 
is non-trivial. An item j may be inverted when it arrives, but it may become 
not inverted due to the expiry of items arrived earlier. We give an interesting 
algorithm to estimate the number of inverted items using existing results on 
basic counting and quantile estimation over sliding windows. Our algorithm also 
incorporates an idea in to remove randomization. 

We also consider two extensions of the problem. 

• Partial ordered items. In some applications, each item arrived is associated 
with multiple attributes, e.g., a network packet may contain both the IP ad- 
dress of the sender and a sequence number. To measure the network quality, it 
is sometimes useful to estimate the most congested traffic coming from a partic- 
ular sender. This corresponds to estimating the ED of packets with respect to 
sequence number from the same sender IP address. In this case, only sequence 
numbers with the same IP address can be ordered. We model such a situation 
by considering items each associated with a value drawn from a partial ordered 
universe. We are interested in estimating the minimum number of items to delete 
so that the remaining items are sorted with respect to the partial order. We give 
a randomized (4 + e)-approximate algorithm using 0{-i^ loge^wlog w) space. 

• Out-of-order streams. When a sender transmits packets to a receiver through 
a network, the packets will go through some intermediate routers. To measure 
the quality of the route between the sender and an intermediate router, it is 
desirable to estimate the ED of the packets received by the router from the 
sender. Yet in some cases, the router may not be powerful enough to deploy 
the algorithm for estimating the ED. We consider delegating the task of esti- 
mation to the receiver. To model the situation, whenever a packet arrives, the 
intermediate router marks in the packet a timestamp recording the number of 
packets received thus far (which can be done by maintaining a single counter). 
Hence, when the packets arrive at the receiver, each packet has both a sequence 
number assigned by the sender and a timestamp marked by the router. Note 
that the packets arrived at the receiver may be out-of-order with respect to the 
timestamp. Such stream corresponds to an out-of-order stream. 

To measure the network quality between the sender and the router, the re- 
ceiver can estimate the ED with respect to the sequence number when the items 
are arranged in increasing order of the timestamps. Intuitively, the problem is 
difficult as items can be inserted in arbitrary positions of the sequence accord- 
ing to the timestamp. We show strong space lower bounds even in the whole 
stream model. In particular, any randomized constant-approximate algorithm 
for estimating ED of an out-of-order stream requires f2{n) space, where n is the 
number of items arrived so far. An identical lower bound holds for estimating 
the LIS. Like most streaming lower bounds, our lower bounds are proved based 
on reductions from two communication problems, namely, the Index problem 
and the DiSJ problem. Optimal communication lower bounds for randomized 
protocols are known for both problems |1|14) . 



Organization. Section [2] and [3] give the formal problem definitions and our 
main algorithm for estimating ED, respectively. Section 2] considers out-of-order 
streams. Due to the page limit, extension to partial ordered items is left to the 
full paper. 

2 Formal problem definitions 

Sortedness of a stream. Consider a stream cr of n items, (cr(l),CT(2), . . . ,<y{n)} 
where each a{i) is drawn from [m] — {1, 2, . . . , m}. The edit distance to mono- 
tonicity (ED) of cr, denoted by ed(tT), is the minimum number of items required to 
remove so as to obtain an increasing subsequence of tr, i.e., (CT(ii), (t(«2), . . . , (j(ik)) 
such that a{ii) < cr(«2) < • • • < cr(ifc) for some 1 < «i < 12 < • • • < u- < 
We use lis(cr) to denote the length of the longest increasing subsequence (LIS) 
of a. Note that \is{a) = n — ed(tT). The sortedness can be computed based on 
the whole stream (all items in a received thus far) or a sliding window covering 
the most recent w items, denoted by a^, for w > 1. Note that the whole stream 
model can be viewed as a special case of the sliding window model with win- 
dow size w = 00. A streaming algorithm has only limited space and can only 
maintain an estimate on the sortedness of cr^. For any r > 1, a r-approximate 
algorithm for estimating ed(cru,) returns, at any time, an estimate ed{aia) such 
that ed(i7iu) < ed{au]) < r ■ ed(crui)- We can define a r-approximate algorithm 
for estimating lis(crt„) similarly. 

Partial ordered universe. We also consider a partial ordered universe with 
binary relation ^. A subsequence of a with length £, ((T(ii), (7(12), ■ • • ,a{ii)), 
is increasing if for any k d — 1], a{ik) ^ criik+i)- Then for any window size 
w > 1, ed{aw) and lis(cr^) can be defined analogously as before. 

Out-of-order stream. The data stream described above is an in-order stream, 
which assumes items arriving in the same order as their creation time. In an 
out-of-order stream, each item is associated with a distinct integral time-stamp 
recording its creation time, which may be different from its arrival time. Pre- 
cisely, an out-of-order stream cr is a sequence of tuples {ti,Vi) (i £ [n]) where 
ti and Vi are the timestamp and value of the i-th item. The sortedness of a 
is defined based on the permuted sequence V{a) = {vi-^ , w.^^ , . . . , Wi„) such that 
iii < < ■ • • < i-e-, ed(cr) := ed{V{a)) and lis(cr) lis{V{a)). 

3 A (4 + e) -approximate algorithm for estimating ED 

In this section, we consider a stream a of items with values drawn from a set 
[m] = {1,2,...,™}, and we are interested in estimating the ED of a sliding 
window covering the most recent w items in a. We give a deterministic (4 -I- e)- 
approximate algorithm which uses 0(^ log^(ew)) space. 

Our algorithm is based on an estimator R{i), which is a generalization of the 
estimator in pTj to the sliding window model. Let i be the index of the latest 
arrived item. The sliding window we consider is cr[i_^+i_i] = (cr(i — w -\- 1), cr(i — 



w + 2), . . . , <j{i)). For any item cr(j), let inv{j) be the set of items arrived before 
a{j) but have greater values than (t(j), i.e., inv{j) — {k : k < j and <y{k) > 
(T{j)}. We define an estimator R{i) for ed{(j[i_^_^^i i]) as follows. 

Definition 1. Consider the current sliding window cr[i-uj+i.i] • We define R{i) 
to be the set of indices j G [i~w + l,i] such that there exists fc G [i — w + l,j — l] 
with \[k,j — 1] ninvij)\ > ■ 

Lemma 1 ([ll])- ed{a[i__^+i^,])/2 < \R{i)\ < 2 • ed(cr[i_^+i^,]). 

Hence, if we know we can return 2|i?(i)| as an estimation for ed(cr[i_t„+i^i]) 

and it gives a 4-approximation algorithm. However, maintaining R{i) exactly 
requires space linear to the window size. In the following, we show how to ap- 
proximate R{i) using significantly less space. 



3.1 Estimating R{i) 

We first present our algorithm and then show that it can approximate R{i). Our 
algorithm will make use of two data structures. Let e' be a constant in (0, 1) 
(which will be set to e/35 later). 

e'-approximate quantile data structure Q: Let Q be a set of items. The 
rank of an item in Q is its position in the list formed by sorting Q from the 
smallest to the biggest. For any (j) G [0, 1], the e'-approximate i/i-quantile of Q is 
an item with rank in [{(p — (^^ + ^OIOI]- We maintain an e'-approximate 

0-quantile data structure given in [13] which can return, at any time, an e'- 
approximate (/)-quantile of the most recent w' items for any w' < w. This data 
structure takes 0{jp^log'^{e'w)) space. 

e'-approximate basic counting data structure B: When an item 17(1) 
arrives, we may associate a token with some item cr(fc) where k < i. The asso- 
ciation is permanent and an item may be associated with more than one token. 
At any time, we are interested in the number of tokens associated with the most 
recent w items. We view it as a stream (Ttokcn of tokens, each of which has a 
timestamp k if it is associated to a{k), and we want to return the number of 
tokens with timestamp in [i — w-\-l,i]. Note that the tokens may be out-of-order 
with respect to the timestamp, leading to the basic counting problem for out-of- 
order stream considered in [6 . We maintain their e'-approximate basic counting 
data structure on crtokon which can return, at any time, an estimate t such that 
\t — t\ < e't, where t is the number of tokens associated with the latest w items. 
It takes 0(p- log w log( j^g^ )) space, where B is the maximum number of tokens 
associated within any window of w items. As we may associate one token upon 
any item arrival, B is at most w. 

We are now ready to define our algorithm, as follows. 



Algorithm 1. Estimating ED in sliding windows 
Item arrival: Upon the arrival of item o'(i), do 
For k — i — l,i — 2, ■ ■ ■ ,i ~ w + 1 

Query Q for the (i — e')-quantile of cr^k^i-i] - Let a be the returned value. 

If a > cr(i), associate a token to cr(fc), i.e., add an item with timestamp k 

to the stream (Jtokcn- Break the for loop. 

Query: Query B on the stream (Ttokcn for the number of tokens associated with 
the last w items and let t be the returned answer. Return — 2e')(l — e') as 
the estimation ed(g[i_^)+i^i]). 

Let be the set of indices j such that when cr(j) arrives, we associate 

a token to an item a{k) where k € [i — w + Observe that R'{i) is an 
approximation of R{i) in the following sense. 

Lemma 2. R'{i) contains all indices j G [i — w+l,i] satisfying that there exists 
k£ [i — w + l,j — l] such that \[k, j — l]r\inv{j)\ > + 2e')(j — fc). Furthermore, 
all indices j contained in R'{i) must satisfy that there exists fc S [i — w + 1, j — 1] 
such that \[k,j — 1] n inv{j)\ > 

Proof. An index j is in R'{i) if cr(j) < a when a{j) arrives, where a is the e'- 
approximate — e')-quantile for some interval cr[i^ j_i] . Note that the rank of a in 
a[k,j-i] is at least (i-2e')(j-fc). Therefore, if |[fc,j-l]nmu(j)| > (i+2e')(j-fc), 
the rank of a{j) is less than (j - fc) - (i + 2e')(j - fc) = - 2e'){j - k), so 
a{j) < a and j must be included in R'{i). On the other hand, the rank of a in 
cr^kj-i] is at most Since a > we conclude that all indices j £ R'{i) 

satisfy j - 1] n mw(j)| > □ 

We show that is a good approximation for ed(cr[i_^-|_i_j]), as follows. 

Lemma 3. (\ - 2e') • ed(cr[i_j„+i_j]) < < 2 • ed(CT[i_^+i,i]). 

Proof. We observe that by LemmaO any index j in R'{i) must be also in R{i). 
Hence, R'{i) C R{i) and \R'{i)\ < \Rii)\ < 2 • ed(cr[i_„+i,,]) (by LemmaH]). 

Now, we show (i — 2e')-ed(CT[i_ii,_|_i^i]) < by giving an iterative pruning 

procedure to obtain an increasing subsequence (may not be the longest). First 
let X = i + l and a{x) = oo. Find the largest j such that i — w + l<j<x and 
j ^ R'{i) U inv{x) and delete the interval [j + 1, a; — 1]. We then let x = j and 
repeat the process until no such j is found. As each x is not in R'{i), Lemma [2] 
implies that in every interval that we delete, the fraction of items of R'ii) is 
at least (| — 2e'). Note that eventually all items in R'{i) will be deleted. Thus, 
\R'ii)\ > - 2e') • (number of deleted items) > (5 - 2e') • ed(cr[,_^+i,i]). □ 

Note that |i?'(«)| equals the number of tokens associated with the most re- 
cent w items. Since B is only an e'-approximate data structure, the value t 
returned only satisfies that (1 — e')\R'{i)\ < t < {1 + e')|i?'(z)|. Since we re- 
port ed(o'[i„uj+i,i]) = — 2e')(l — e') as the estimation, we conclude with the 
following approximation ratio. 



Lemma 4. ed(cr[,_^+i,i] ) < ed{a[,_^+i,i]) < ^i/2^-2^)(i^e') ' ed(cr[,_^,+i,i] ) 

For any e < 1, we can set e' — e/35. Then, (i/2-2e^)^(i-e') ' ^'^i'^[i-w+i,i]) ^ 
(4 + e) • ed(cr[i_u,+i^i]). The total space usage of the two data structures is 
0( log^(ez«) + Mogw log(ew)). If e > ^, logw = 0(Mog(ew)) and thus the 
total space usage is log'^(eti;)). Otherwise, we can store all items in the 
window, which only requires 0{w) = O(^) space. 

Improving the running time. The per-item update time of the algorithm 
is 0{w) because the algorithm checks the interval I — [k,i — 1] for every length 
|/| G [w — 1]. An observation in [5] is that an ^-approximate (/i-quantile of an 
interval with length \I\ is also an e'-approximate (/)-quantile for all intervals with 
length + - + Hence we only need to check 0{-p- log w) intervals 

of length 1,2, •• • , (1 + ^)*, (1 + ^)*+^, • • • ,w. Then we obtain an e'-approximate 
quantile for every interval. Note that the query time for returning an approximate 
quantile is 0(p- log^ w), and the per-item update time of the two data structures 
is log'^ w) [6113] . We conclude with the main result of this section. 

Theorem 1. There is a deterministic (4 + e)- approximate algorithm for es- 
timating ED in a sliding window of the latest w items. The space usage is 
0{-^ log^(ew)) and the per-item update time is log'^ w). 

Remark. For the whole stream model, the state-of-the-art result is a (2 -f e)- 
approximation in [8]. They gave an improved estimator R{i) as the set of indices 
j such that there exists k < j with \[k,j — 1] n inv{j)\ > \[k,j — 1] n R{i)\- In 
other words, whether an index belongs to R{i) or not depends on the number of 
members of R{i) before that index. Note that a member of R{i) could become 
a nonmember due to window expiration. Therefore, an index j that is not a 
member of R{i) initially, may later become a member if some of the previous 
R{i) members become nonmembers. This makes estimating this improved R{i) 
difficult in the sliding window model. 

4 Lower bounds for out-of-order streams 

In this section, we consider an out-of-order stream a consisting of a sequence of 
items a{i) = {ti, vt) for i G [N], where ti and Vi are the timestamp and value of 
the i-th item, respectively. Recall that the sortedness of the stream is measured 
on the derived value sequence by rearranging the items in non-decreasing order 
of the timestamps. We show that even for the whole data stream model, any 
randomized constant-approximate algorithm for estimating ED or LIS requires 
f2{N) space. In fact, a stronger lower bound holds for ED: any randomized 
algorithm that decides whether ED equals uses f2{N) space. Our proofs follow 
from reductions from two different communication problems. 



4.1 Estimating ED in an out-of-order stream 



Theorem 2. Consider an out-of-order stream a of size N. Any randomized 
algorithm that distinguish between the cases that ed{a) — and that ed(cr) > 1 
must use i7(7V) bits. Therefore, for arbitrary constant r > 1, any randomized 
r- approximation to ed(cr) requires f2{N) bits. 

We prove the above lower bound by showing a reduction from the classical 
communication problem Index, which has strong communication lower bound. 

The problem Index(x, i) is a two-player one-way communication game. Al- 
ice holds a binary string x e {0, 1}" and Bob holds an index i g [n]. In this 
communication game, Alice sends one message to Bob and Bob is required to 
output the i-th bit of x, i.e. Xi, based on the message received. A trivial protocol 
is for Alice to send all her input string x to Bob, which has communication com- 
plexity of n bits. It turns out that this protocol is optimal. Particularly, Alice 
must communicate J7(n) bits in any randomized protocol for Index [1]. 

Proof (of Theorem\^. Given an out-of-order stream with length N, suppose 
there is a randomized algorithm A that can determine whether its ED equals 
to or is at least 1 using S memory bits. We define a randomized protocol V 
for lNDEX(a;,i) for n — N — 1: Alice constructs (hypothetically) an out-of-order 
stream a with length n by setting 



Alice then simulates algorithm A on stream a and sends the content of the 
working memory to Bob. Bob constructs another stream item a{n-\-l) = {2i, 3i — 
1) to continue running algorithm A on it and obtains the output. If the output 
says ed(cr) = 0, Bob outputs 0; otherwise, Bob outputs 1. 

It is not hard to see that Index(x, i) = Xi = implies ed(cr) = and 
lNDEX(a:, i) = 1 implies ed(cr) = 1. Therefore, if algorithm A reports the cor- 
rect answer with high probability, the protocol V outputs correctly with high 
probability, and thus is a valid randomized protocol for Index. In the proto- 
col, the number of bits communicated by Alice is at most S. Combining the 
n{n) = f2{N) lower bound, we obtain that S = fi{N), completing the proof. 



4.2 Estimating LIS in an out-of-order stream 

Theorem 3. Consider an out-of-order stream a with size N . Any randomized 
algorithm that outputs an r- approximation on lis(f7) must use f2(N/r'^) bits. 

Proof. We prove the lower bound by considering the t-party set disjointness 
problem DiSJ. The input to this communication game is a binary t x £ matrix 
X € {0, 1}*^, and each player Pi holds one row of x, the 1-entries of which 
indicate a subset Ai of [£]. The input x is called disjoint if the t subsets are 




(1) 



□ 



pairwise disjoint, i.e., each column of x contains at most one 1-entry; and it is 
called uniquely intersecting if the subsets Ai share a unique common element y 
and the sets Ai — {y} are pairwise disjoint, meaning that in x, except one column 
with entries all equal to 1, all the other columns have at most one 1-entry. The 
objective of the game is to distinguish between the two types of inputs. To 
obtain the space lower bound, we only need to consider a restricted version of 
DiSJ where, according to some probabilistic protocol, the first t — I players in 
turn send a message privately to his next neighboring player and the last player 
P( outputs the answer. 

An optimal lower bound of fi[l/t) total communication is known for DiSJ even 
for general randomized protocols (with constant success probability) [14_, and 
thus the lower bound also holds for our restrited one-way private communication 
model. By giving a reduction and setting the parameters appropriately, we can 
obtain the space lower bound. 

Given a randomized algorithm that outputs r-approximation to the LIS of 
any out-of-order stream with length N , using S memory bits, we define a simple 
randomized protocol for DiSJ for t = 2r = o{N) and l = N + l- t = 0{N). 
Let X be the input t x I matrix. The first player Pi creates an out-of-order 
stream a by going through his row of input i?i(x) and inserting a new item 
{{j — l)t + 1, — i)t + 1) to the end of the stream whenever an entry xij 
equals to 1. He then runs the streaming algorithm on a and sends the content 
of the memory to the second player. In general, player Pi appends a new item 
((j — l)t + i,{£ — i)t + i) to the stream for each nonzero entry , simulates 
the streaming algorithm and communicates the updated memory state to the 
next player. Finally, player Pt obtains the approximated LIS of stream a. If it is 
at most r he reports that the input x is disjoint; else, he reports it is uniquely 
intersecting. It's easy to verify that if the input x is disjoint, the correct LIS 
of stream cr is 1, while if it is uniquely intersecting, the correct LIS of a is t. 
Consequently, if the streaming algorithm outputs an r-approximation to lis(cr) 
with probability at least 2/3, the protocol for DiSJ is correct with constant 
probability, using total communication at most {t — 1)S. Following the lower 
bound for DiSJ, this implies {t - 1)5' > n{l/t), i.e., S = [2{£/t^) = [2{N/r^). 
Theorem [3] follows. 

Remark. Actually, for deterministic algorithms, we can obtain a slightly 
stronger lower bound of f2(N/r) for r-approximation, by a reduction from the 
HiDDEN-IS problem used in [TU] to prove the i7(\/7V) lower bound for approxi- 
mating LIS of an in-order stream. The reduction is similar to the above, and if 
we set the approximation ratio r to a constant, the lower bounds become linear 
in both cases. Therefore, we neglect the details here. 
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